Most of the plots are interactive, you can click or zoom to get more details ! Also don’t hesitate to click on plots, they will zoom automatically !
# Loading Packages
library(data.table) # Efficient Dataframe
library(lubridate) # For Dates
library(tidyverse)
library(esquisse) # Intuitive plotting
library(plyr)
library(ggplot2) # Plot Graphs
library(naniar) #for NA exploration in Dataframe
library(sp) #spatial data
library(plotly) # Make ggplot2 Dynamic
library(gissr) # Spatial Transformations
library(leaflet) # For Map
library(leaflet.providers) # For Custom Icons
library(geosphere) # Spatial Calculations
library(DT) # Render Table in a explorable UI
library(gridExtra)
library(corrplot) # Correlation Plot
library(RColorBrewer) # For Color Palette
library(rmdformats) # Theme of HTMLThose are required packages
Geosphere: Spherical trigonometry for geographic applications. That is, compute distances and related measures for angular (longitude/latitude) locations.
Gissr: gissr is a collection of R functions which make working with spatial data easier.
Loading the dataset called “LaptopSales_red.csv” given for the Homework
FALSE Classes 'data.table' and 'data.frame': 148786 obs. of 17 variables:
FALSE $ V1 : int 171289 38634 260048 166045 243280 118859 249957 198058 198850 267007 ...
FALSE $ Date : chr "9/20/2008 2:49" "5/30/2008 9:52" "12/10/2008 9:26" "9/15/2008 9:41" ...
FALSE $ Configuration : int 528 307 235 168 517 738 301 301 479 472 ...
FALSE $ Customer.Postcode : chr "NW5 1SP" "N6 6BU" "CR0 2BW" "WC2H 9PS" ...
FALSE $ Store.Postcode : chr "N3 1DH" "N3 1DH" "CR7 8LE" "SW1P 3AU" ...
FALSE $ Retail.Price : int 413 515 315 NA 580 535 455 465 600 392 ...
FALSE $ Screen.Size..Inches. : int 17 15 15 15 17 17 15 15 17 17 ...
FALSE $ Battery.Life..Hours. : int 4 6 5 5 4 6 6 6 4 4 ...
FALSE $ RAM..GB. : int 2 1 2 1 2 1 1 1 1 1 ...
FALSE $ Processor.Speeds..GHz.: num 2.4 2 2.4 2 2.4 2 1.5 1.5 2.4 2.4 ...
FALSE $ Integrated.Wireless. : chr "No" "Yes" "No" "Yes" ...
FALSE $ HD.Size..GB. : int 300 80 80 300 120 40 120 120 300 300 ...
FALSE $ Bundled.Applications. : chr "No" "Yes" "Yes" "No" ...
FALSE $ customer.X : int 528771 528281 532781 530190 537350 532498 533130 529390 533998 532498 ...
FALSE $ customer.Y : int 186041 187336 166444 181139 169306 168334 182489 181270 168421 168334 ...
FALSE $ store.X : int 525109 525109 532714 529902 528739 528739 534057 528924 528739 532714 ...
FALSE $ store.Y : int 190628 190628 168302 179641 173080 173080 179682 178440 173080 168302 ...
FALSE - attr(*, ".internal.selfref")=<externalptr>
Retail Price is the only variable missing at rate of approximately 4%
This barplot shows the most frequent retail prices for all stores in 2018. In Black is the median
We can interpret this boxplot as the mean or median retail price of the 2018 Computer Dataset, click on the white sphere to get the mean !
## [1] "Last Recorded Prices are 406 USD and 530 USD on the same Day with a mean of 468 USD"
Here is given the last recorded prices for 2018
Those Plot shows different aggregations levels, can be used depending on the analysis we want, thus the granularity need.
Each box plots belongs to a specific stores, we can see a common trend across all stores in 2018
Looking at times series, we can see that not all stores have the same time trend, but most of them do.
FALSE `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Using an smooth approximator, we can see two differents trends, first a rapid increase in price while being at low configurations, and then the slope tend to stay constant and low, ending with a increase with highest configurations.
Enjoy looking at each stores and customers in London UK ! You can find there exact location by clicking on them !
transform_coordinates: Is a convinient function from Gissr (on Gihtub) that use the cran-project SpTransform as source code but can directly use coordinates in a dataframe and return it in a dataframe. The spTransform methods provide transformation between datum(s) and conversion between projections (also known as projection and/or re-projection), from one unambiguously specified coordinate reference system (CRS) to another, prior to version 1.5 using Proj4 projection arguments.
The following barplots show two ways of analyzing the stores sales results: by the number of transactions or the sales revenues they each generated during 2018.
With this plot we can see the distance between Customers and Stores in terms of latitude and longitude.
DistHarversine: The shortest distance between two points (i.e., the ’great-circle-distance’ or ’as the crow flies’), according to the ’haversine method’. This method assumes a spherical earth, ignoring ellipsoidal effects. The Haversine (’half-versed-sine’) formula was published by R.W. Sinnott in 1984, although it has been known for much longer. At that time computational precision was lower than today (15 digits precision). With current precision, the spherical law of cosines formula appears to give equally good results down to very small distances.
Each Unique Customer can be found here, scroll down and see the distance they need to travel to get to their store.
Histogram of the Distance between Clients and Stores
You can see the proportional revenues participation of each stores in 2018.
We can see that S1P 3AU propose higher configurations, while having the smallest % revenues participation out of the total revenues of the company, this could be because it sells higher priced configurations, thus selling less to customer during the year, only to a smaller client pool that wants a better PC for more productive computing work.
With this multiple facets barplots, you can spot which configuration is less or not sold depending on the store.
Loading the dataset called “Cereals.csv” given for the Homework
FALSE Classes 'data.table' and 'data.frame': 77 obs. of 16 variables:
FALSE $ name : chr "100%_Bran" "100%_Natural_Bran" "All-Bran" "All-Bran_with_Extra_Fiber" ...
FALSE $ mfr : chr "N" "Q" "K" "K" ...
FALSE $ type : chr "C" "C" "C" "C" ...
FALSE $ calories: int 70 120 70 50 110 110 110 130 90 90 ...
FALSE $ protein : int 4 3 4 4 2 2 2 3 2 3 ...
FALSE $ fat : int 1 5 1 0 2 2 0 2 1 0 ...
FALSE $ sodium : int 130 15 260 140 200 180 125 210 200 210 ...
FALSE $ fiber : num 10 2 9 14 1 1.5 1 2 4 5 ...
FALSE $ carbo : num 5 8 7 8 14 10.5 11 18 15 13 ...
FALSE $ sugars : int 6 8 5 0 8 10 14 8 6 5 ...
FALSE $ potass : int 280 135 320 330 NA 70 30 100 125 190 ...
FALSE $ vitamins: int 25 0 25 25 25 25 25 25 25 25 ...
FALSE $ shelf : int 3 3 3 3 3 1 2 3 1 3 ...
FALSE $ weight : num 1 1 1 1 1 1 1 1.33 1 1 ...
FALSE $ cups : num 0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
FALSE $ rating : num 68.4 34 59.4 93.7 34.4 ...
FALSE - attr(*, ".internal.selfref")=<externalptr>
Ordinal: shelf, rating
Nominal: name, mfr, type
Quantitative/Numerical: calories, protein, fat, sodium , sugars, potass, weight, cups, vitamins, fiber, carbo
Summary
FALSE name mfr type calories
FALSE Length:74 Length:74 Length:74 Min. : 50
FALSE Class :character Class :character Class :character 1st Qu.:100
FALSE Mode :character Mode :character Mode :character Median :110
FALSE Mean :107
FALSE 3rd Qu.:110
FALSE Max. :160
FALSE protein fat sodium fiber carbo
FALSE Min. :1.000 Min. :0 Min. : 0.0 Min. : 0.000 Min. : 5.00
FALSE 1st Qu.:2.000 1st Qu.:0 1st Qu.:135.0 1st Qu.: 0.250 1st Qu.:12.00
FALSE Median :2.500 Median :1 Median :180.0 Median : 2.000 Median :14.50
FALSE Mean :2.514 Mean :1 Mean :162.4 Mean : 2.176 Mean :14.73
FALSE 3rd Qu.:3.000 3rd Qu.:1 3rd Qu.:217.5 3rd Qu.: 3.000 3rd Qu.:17.00
FALSE Max. :6.000 Max. :5 Max. :320.0 Max. :14.000 Max. :23.00
FALSE sugars potass vitamins shelf
FALSE Min. : 0.000 Min. : 15.00 Min. : 0.00 Min. :1.000
FALSE 1st Qu.: 3.000 1st Qu.: 41.25 1st Qu.: 25.00 1st Qu.:1.250
FALSE Median : 7.000 Median : 90.00 Median : 25.00 Median :2.000
FALSE Mean : 7.108 Mean : 98.51 Mean : 29.05 Mean :2.216
FALSE 3rd Qu.:11.000 3rd Qu.:120.00 3rd Qu.: 25.00 3rd Qu.:3.000
FALSE Max. :15.000 Max. :330.00 Max. :100.00 Max. :3.000
FALSE weight cups rating
FALSE Min. :0.500 Min. :0.2500 Min. :18.04
FALSE 1st Qu.:1.000 1st Qu.:0.6700 1st Qu.:32.45
FALSE Median :1.000 Median :0.7500 Median :40.25
FALSE Mean :1.031 Mean :0.8216 Mean :42.37
FALSE 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:50.52
FALSE Max. :1.500 Max. :1.5000 Max. :93.70
Standard Errors
FALSE name mfr type calories protein fat sodium
FALSE NA NA NA 19.8438928 1.0758016 1.0068260 82.7697871
FALSE fiber carbo sugars potass vitamins shelf weight
FALSE 2.4233912 3.8916746 4.3591113 70.8786815 22.2943521 0.8320674 0.1534155
FALSE cups rating
FALSE 0.2357153 14.0337125
Histogram of Quantitative Variables
Standards Errors
FALSE calories protein fat sodium sugars potass weight
FALSE 19.4841191 1.0947897 1.0064726 83.8322952 4.3786564 70.4106360 0.1504768
FALSE cups vitamins fiber carbo
FALSE 0.2327161 22.3425225 2.3833640 3.9073256
Based on the Histogram Grid and the Standard Errors Summary, Sodium, Potass and Vitamins have the largest variability.
Potassium, Fiber and Fat seem skewed. Sugar could also be.
We can see that Fiber has at least 2 extremes values (2 classes away from the main cluster), calories, vitamins and weight as well. Cups could also have extremes values.
We are lacking data about Hot Type Cereals to compare both state of cereals.
Shelf 1 and 3 are pretty close (both median close to 40-42), we could only take a mean of category 1 et 3 and then predict with category 2.
We could select the highest correlated variable (because of threat of multicollinearity) and removed them. In the context of a Regression, using VIF on our model would suggest use which explanatories variabes we should remove based on those correlations table.
Correlation Matrix
FALSE calories protein fat sodium sugars
FALSE calories 1.00000000 0.03399166 0.5073732397 0.2962474981 0.569120535
FALSE protein 0.03399166 1.00000000 0.2023533963 0.0115588913 -0.286583967
FALSE fat 0.50737324 0.20235340 1.0000000000 0.0008219036 0.287152487
FALSE sodium 0.29624750 0.01155889 0.0008219036 1.0000000000 0.037058961
FALSE sugars 0.56912054 -0.28658397 0.2871524866 0.0370589612 1.000000000
FALSE potass -0.07136125 0.57874284 0.1996367171 -0.0394380876 0.001413982
FALSE weight 0.69645215 0.23067141 0.2217141647 0.3125335701 0.460547135
FALSE cups 0.08919615 -0.24209861 -0.1575787041 0.1195841083 -0.032436100
FALSE vitamins 0.25984556 0.05479952 -0.0305139099 0.3315759640 0.072954382
FALSE fiber -0.29521183 0.51400610 0.0140358654 -0.0707349230 -0.150948502
FALSE carbo 0.27060605 -0.03674326 -0.2849336855 0.3284091857 -0.452069189
FALSE potass weight cups vitamins fiber
FALSE calories -0.071361247 0.6964521 0.08919615 0.25984556 -0.29521183
FALSE protein 0.578742837 0.2306714 -0.24209861 0.05479952 0.51400610
FALSE fat 0.199636717 0.2217142 -0.15757870 -0.03051391 0.01403587
FALSE sodium -0.039438088 0.3125336 0.11958411 0.33157596 -0.07073492
FALSE sugars 0.001413982 0.4605471 -0.03243610 0.07295438 -0.15094850
FALSE potass 1.000000000 0.4205615 -0.50168832 -0.00263583 0.91150392
FALSE weight 0.420561534 1.0000000 -0.20171465 0.32043480 0.24629218
FALSE cups -0.501688318 -0.2017146 1.00000000 0.13362965 -0.51369716
FALSE vitamins -0.002635830 0.3204348 0.13362965 1.00000000 -0.03871734
FALSE fiber 0.911503921 0.2462922 -0.51369716 -0.03871734 1.00000000
FALSE carbo -0.365002934 0.1448053 0.35828371 0.25357897 -0.37908370
FALSE carbo
FALSE calories 0.27060605
FALSE protein -0.03674326
FALSE fat -0.28493369
FALSE sodium 0.32840919
FALSE sugars -0.45206919
FALSE potass -0.36500293
FALSE weight 0.14480528
FALSE cups 0.35828371
FALSE vitamins 0.25357897
FALSE fiber -0.37908370
FALSE carbo 1.00000000
Nothing changes when we normalized the data before correlation matrices and plots since normalization already occurs when computing correlations.